Skip to content

OCPBUGS-76530: Fix intermittent etcd peer communication failures#8479

Open
vsolanki12 wants to merge 2 commits into
openshift:mainfrom
vsolanki12:OCPBUGS-76530-etcd-peer-cert-san
Open

OCPBUGS-76530: Fix intermittent etcd peer communication failures#8479
vsolanki12 wants to merge 2 commits into
openshift:mainfrom
vsolanki12:OCPBUGS-76530-etcd-peer-cert-san

Conversation

@vsolanki12

@vsolanki12 vsolanki12 commented May 11, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Both etcd-discovery and etcd-client headless services select the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR adds *.etcd-client.{ns}.svc and *.etcd-client.{ns}.svc.cluster.local SANs to the etcd peer certificate, ensuring both possible PTR lookup results match.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The fix triggers a one-time etcd peer certificate regeneration on first CPO reconciliation after deploy
  • The config hash annotation change causes a rolling restart of etcd pods (safe in HA — quorum maintained)
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • The server cert (ReconcileEtcdServerSecret) already covers both services; the peer cert was the only gap

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Summary by CodeRabbit

  • Bug Fixes

    • Etcd peer certificates now include additional DNS SANs to improve service discovery and connectivity across the cluster.
  • Tests

    • New unit test coverage validating the peer certificate’s SANs and that the certificate supports both client and server authentication.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 11, 2026
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Both etcd-discovery and etcd-client headless services select the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR adds *.etcd-client.{ns}.svc and *.etcd-client.{ns}.svc.cluster.local SANs to the etcd peer certificate, ensuring both possible PTR lookup results match.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The fix triggers a one-time etcd peer certificate regeneration on first CPO reconciliation after deploy
  • The config hash annotation change causes a rolling restart of etcd pods (safe in HA — quorum maintained)
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • The server cert (ReconcileEtcdServerSecret) already covers both services; the peer cert was the only gap

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The pull request updates ReconcileEtcdPeerSecret to add two DNS SANs to the etcd peer certificate: *.etcd-client.<namespace>.svc and *.etcd-client.<namespace>.svc.cluster.local. It also adds a unit test TestReconcileEtcdPeerSecret which generates a CA, invokes the reconciler on an empty peer secret, and asserts the resulting certificate contains the expected DNSNames (including discovery and client wildcards plus 127.0.0.1 and ::1) and both client and server extended key usages.

Suggested Reviewers

  • rtheis
  • csrwng
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: adding DNS SANs to the etcd peer certificate to fix intermittent peer communication failures caused by non-deterministic PTR record resolution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR does not use Ginkgo testing framework. It uses standard Go testing with t.Run() and gomega matchers. The test names are static, descriptive, and deterministic with no dynamic data.
Test Structure And Quality ✅ Passed The custom check requires Ginkgo test structure review. However, the test uses standard Go testing with t.Run() subtests, not Ginkgo. The check is not applicable to this test pattern.
Microshift Test Compatibility ✅ Passed The PR adds a standard Go unit test, not Ginkgo e2e tests. The MicroShift check applies only to Ginkgo e2e tests. Not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. Only a standard Go unit test (TestReconcileEtcdPeerSecret) that tests internal certificate generation without topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies certificate generation only, adding DNS SANs to etcd peer certs. No scheduling constraints introduced; not applicable to topology check.
Ote Binary Stdout Contract ✅ Passed PR is not an OTE binary test. Changes are to operational code (etcd.go certificate logic) and standard Go unit tests. No process-level code modified. No stdout violations detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This is a standard Go unit test, not a Ginkgo e2e test. The custom check only applies to "new Ginkgo e2e tests" (with It(), Describe(), Context(), When() markers). This test does not use Ginkgo.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label May 11, 2026
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign muraee for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.69%. Comparing base (7625684) to head (2511ea2).
⚠️ Report is 203 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8479      +/-   ##
==========================================
+ Coverage   40.41%   40.69%   +0.28%     
==========================================
  Files         755      755              
  Lines       93235    93368     +133     
==========================================
+ Hits        37679    37999     +320     
+ Misses      52854    52636     -218     
- Partials     2702     2733      +31     
Files with missing lines Coverage Δ
...perator/controllers/hostedcontrolplane/pki/etcd.go 48.27% <100.00%> (+48.27%) ⬆️

... and 18 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.70% <ø> (+0.26%) ⬆️
cpo-hostedcontrolplane 41.88% <100.00%> (+0.11%) ⬆️
cpo-other 41.39% <ø> (+1.07%) ⬆️
hypershift-operator 50.82% <ø> (+0.09%) ⬆️
other 31.61% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`:
- Around line 50-51: The code currently places the IP addresses "127.0.0.1" and
"::1" in the dnsNames slice; move them into an ips slice and pass that as the
ips parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries
in dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 07563a75-d22c-4fe1-a6eb-ec3094b49e1e

📥 Commits

Reviewing files that changed from the base of the PR and between e66488b and 7d8ea25.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go

Comment on lines 50 to 51
"127.0.0.1",
"::1",

@coderabbitai coderabbitai Bot May 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check the signature of reconcileSignedCertWithKeysAndAddresses and verify 
# whether IP addresses should be in a separate parameter.

# Find the function definition
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) {
  $$$
}'

# Also search for how IPAddresses field is populated in certificate generation
rg -n -A5 -B5 'IPAddresses.*=' --type=go

Repository: openshift/hypershift

Length of output: 18256


🏁 Script executed:

# First, examine the etcd.go file around the lines in question
sed -n '40,65p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go

# Also look for the function being called and its implementation
rg -n "reconcileSignedCertWithKeysAndAddresses" --type=go control-plane-operator/controllers/hostedcontrolplane/pki/

Repository: openshift/hypershift

Length of output: 2936


🏁 Script executed:

# Look for etcd_test.go
fd -name "etcd_test.go" control-plane-operator/controllers/hostedcontrolplane/pki/

# If found, check what it asserts about the certificate
if [ -f control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go ]; then
  cat control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
fi

Repository: openshift/hypershift

Length of output: 2462


Move IP addresses from dnsNames to ips parameter to comply with x509 certificate standards.

The IP addresses 127.0.0.1 and ::1 are included in the dnsNames slice, but x509 certificates have separate DNSNames and IPAddresses fields. The reconcileSignedCertWithKeysAndAddresses function explicitly accepts both parameters. Pass the IPs as the ips parameter instead of including them in dnsNames:

dnsNames := []string{
    fmt.Sprintf("*.etcd-discovery.%s.svc", secret.Namespace),
    fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace),
    fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace),
    fmt.Sprintf("*.etcd-client.%s.svc.cluster.local", secret.Namespace),
}
ips := []string{
    "127.0.0.1",
    "::1",
}

return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, ips, "")

Update the corresponding test assertion in etcd_test.go to check cert.IPAddresses instead of cert.DNSNames for the IP values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go` around
lines 50 - 51, The code currently places the IP addresses "127.0.0.1" and "::1"
in the dnsNames slice; move them into an ips slice and pass that as the ips
parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries in
dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.

@vsolanki12 vsolanki12 May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation, but this predates this PR, the IPs were already in dnsNames before this change. Moving them to ips would be a separate refactor with its own cert regeneration implications.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsolanki12, that's a fair point — the 127.0.0.1/::1 placement in dnsNames is pre-existing technical debt outside the scope of this fix, and bundling another cert regeneration here would be unnecessarily disruptive.

Would you like me to open a follow-up GitHub issue to track the refactor of moving those IPs into the proper ips parameter of reconcileSignedCertWithKeysAndAddresses? That way it doesn't get lost.


✏️ Learnings added
Learnt from: vsolanki12
Repo: openshift/hypershift PR: 8479
File: control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go:50-51
Timestamp: 2026-05-12T06:51:42.347Z
Learning: In `control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`, the IP addresses `127.0.0.1` and `::1` are intentionally placed in the `dnsNames` slice (rather than the `ips` parameter of `reconcileSignedCertWithKeysAndAddresses`) as pre-existing behavior that predates PR `#8479`. Moving them to the correct `ips` parameter is a known technical debt but is intentionally deferred to avoid triggering an additional etcd peer certificate regeneration cycle.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsolanki12 mind opening a jira card to track IPs in dnsNames technical debit?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 7d8ea25 to 26c7fc3 Compare May 12, 2026 06:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`:
- Around line 57-58: The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 340ac5a3-0b9f-4714-aa45-843df895f434

📥 Commits

Reviewing files that changed from the base of the PR and between 7d8ea25 and 26c7fc3.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go

Comment on lines +57 to +58
"127.0.0.1",
"::1",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check how ReconcileEtcdPeerSecret handles IP addresses in certificate generation

# Search for the implementation of ReconcileEtcdPeerSecret
ast-grep --pattern $'func ReconcileEtcdPeerSecret($$$) $$$'

# Search for certificate generation calls that might include IPs
rg -A 10 -B 5 "127\.0\.0\.1|::1" --type go --glob '**/pki/etcd.go'

# Look for DNSNames and IPAddresses field assignments
rg -A 5 "DNSNames|IPAddresses" --type go --glob '**/pki/etcd.go'

Repository: openshift/hypershift

Length of output: 2726


🏁 Script executed:

# Find the reconcileSignedCertWithKeysAndAddresses function implementation
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) $$$'

# Also search for where DNSNames is actually set in the certificate
rg -B 5 -A 15 "DNSNames\s*=" --type go --glob '**/pki/*.go' | head -100

# Look for IPAddresses assignment in certificate generation
rg -B 5 -A 5 "IPAddresses\s*=" --type go --glob '**/pki/*.go'

Repository: openshift/hypershift

Length of output: 1123


🏁 Script executed:

# Find where certs.ReconcileSignedCert is imported from
rg "import.*certs" --type go --glob '**/pki/cert.go' -A 20 | head -40

# Find the certs package and ReconcileSignedCert implementation
find . -name "*.go" -type f -exec grep -l "func ReconcileSignedCert" {} \;

# Search for how it populates certificate fields
rg -B 5 -A 20 "func ReconcileSignedCert" --type go

Repository: openshift/hypershift

Length of output: 1324


🏁 Script executed:

# Get the full ReconcileSignedCert implementation
sed -n '1,300p' support/certs/tls.go | tail -200

Repository: openshift/hypershift

Length of output: 6056


🏁 Script executed:

# Find where the CSR is generated with dnsNames and ips in ReconcileSignedCert
sed -n '/^func ReconcileSignedCert/,/^func [A-Z]/p' support/certs/tls.go | head -100

Repository: openshift/hypershift

Length of output: 2748


🏁 Script executed:

# Get the test file context around lines 57-58
sed -n '35,75p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go

Repository: openshift/hypershift

Length of output: 1305


Fix IP address handling in etcd peer certificate generation.

Lines 50–51 in etcd.go incorrectly add IP addresses ("127.0.0.1", "::1") to the dnsNames list. Per X.509 standards, IP addresses must be in the IPAddresses field, not DNSNames. The function signature supports a separate ips parameter for this purpose.

Fix the implementation by passing IPs via the dedicated parameter:

return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, []string{"127.0.0.1", "::1"}, "")

Then update the test assertion to check the correct certificate field:

g.Expect(cert.IPAddresses).To(ContainElements(
    net.ParseIP("127.0.0.1"),
    net.ParseIP("::1"),
))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`
around lines 57 - 58, The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest hypershift-operator-main

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12 vsolanki12 marked this pull request as ready for review May 13, 2026 05:51
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026
@openshift-ci openshift-ci Bot requested review from enxebre and sdminonne May 13, 2026 05:51
@sdminonne

Copy link
Copy Markdown
Contributor

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 26c7fc3 to 4b1d450 Compare May 14, 2026 15:37
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: geliu2016.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fmt.Sprintf("*.etcd-discovery.%s.svc", secret.Namespace),
fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace),
fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace),
fmt.Sprintf("*.etcd-client.%s.svc.cluster.local", secret.Namespace),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please add a // comment on the reasoning why this is needed so it's clear for humans/agents landing here without needing to trace back the PR desc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @enxebre , I have added the comment as per the suggestion

@enxebre

enxebre commented May 15, 2026

Copy link
Copy Markdown
Member

why this vs making the client SVC clusterIP? would that work?

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 4b1d450 to 4b8f67e Compare May 15, 2026 08:26
@vsolanki12

vsolanki12 commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @enxebre ,
The etcd-client service is headless and clients connecting to it get the individual pod IPs directly, which is important for etcd client load balancing and direct pod connectivity. A regular ClusterIP service would route through kube-proxy, adding a layer of indirection that may not be desirable for etcd's latency sensitive traffic. Additionally, changing the service type would be a major change affecting all etcd clients, while adding SANs is a minimal, targeted fix that only affects the peer certificate.

@sdminonne sdminonne left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid fix for a real bug, Thanks! Unfortunately I've to ask you to track the current erroneous behavior with a JIRA card and a comment to reference it

Comment on lines 50 to 51
"127.0.0.1",
"::1",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsolanki12 mind opening a jira card to track IPs in dnsNames technical debit?

"*.etcd-client.clusters-test.svc",
"*.etcd-client.clusters-test.svc.cluster.local",
"127.0.0.1",
"::1",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is encoding the behavior that should be fixed. Mind after creating the JIRA card reference it here in a comment with a TODO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sdminonne ,
I have added the TODO comment and raised the JIRA card(OCPBUGS-86648) as per the suggestion.

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 4b8f67e to c9fcd75 Compare May 28, 2026 06:16
Both etcd-discovery and etcd-client headless services select the same
etcd pods, causing CoreDNS to register two PTR records per pod IP.
The etcd peer certificate only included *.etcd-discovery SANs, so when
Go's getnameinfo() non-deterministically returned the etcd-client PTR
record, the TLS SAN check failed and peer connections were rejected.

Add *.etcd-client wildcard SANs to the peer certificate so both
possible PTR lookup results match.
@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from c9fcd75 to 1ccc561 Compare May 28, 2026 08:11
Both etcd-client and etcd-discovery headless services select the same
etcd pods, causing dual PTR records per pod IP. Converting etcd-client
to a regular ClusterIP service eliminates the duplicate PTR records at
the root cause, complementing the SAN-based fix.

Co-Authored-By: Cesar Wong <[email protected]>
Signed-off-by: Vimal Solanki <[email protected]>
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have all the evidence I need. The analysis is conclusive. Let me compile the final report.

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / control-plane-operator-enterprise-contract / control-plane-operator-main
  • Build ID: control-plane-operator-enterprise-contract-nbcb8
  • Pipeline Namespace: crt-redhat-acm-tenant
  • Snapshot: control-plane-operator-20260602-141057-000
  • Run Window: 2026-06-02T14:00:26Z → 2026-06-02T14:27:47Z (~27 min)

Test Failure Analysis

Error

Integration test for component control-plane-operator-main snapshot
control-plane-operator-20260602-141057-000 and scenario
control-plane-operator-enterprise-contract has failed

PipelineRun: control-plane-operator-enterprise-contract-nbcb8
Task results table: EMPTY — zero tasks executed (expected: collect-keyless-params, verify)

Summary

This is a Konflux infrastructure failure, not a code or policy violation. The Enterprise Contract pipeline run started but its two tasks (collect-keyless-params and verify) never executed — the GitHub check run's task results table is completely empty. All four other EC checks on the same commit completed normally with tasks visible (256–274 policy checks passing), and the container image build itself (control-plane-operator-main-on-pull-request) succeeded. The same empty-table EC failure pattern occurred on PR #8642, which was merged the same day. All recent PRs (#8711#8717) pass this check. The PR's code changes (adding etcd-client SANs to peer certificates and switching etcd-client to ClusterIP) are unrelated to the Konflux pipeline infrastructure.

Root Cause

The Konflux Tekton pipeline run control-plane-operator-enterprise-contract-nbcb8 experienced a task scheduling or initialization failure within the stone-prd-rh01 cluster. The pipeline was marked as failure but produced no task-level results whatsoever.

A normal successful run of this pipeline (visible on recent PRs #8711#8717) executes two tasks:

  1. collect-keyless-params (~10s) — collects Sigstore keyless signing parameters
  2. verify (~25s) — runs ~500+ Enterprise Contract policy checks against the built image

On the failed run, neither task appears in the results table, indicating the failure occurred at the Tekton PipelineRun controller level — before any task pods were scheduled. This is characteristic of transient Konflux infrastructure issues such as:

  • Tekton controller pod restart or resource pressure
  • PipelineRun admission/scheduling timeout
  • Namespace resource quota exhaustion in crt-redhat-acm-tenant

Key differentiator: This is NOT a policy violation. When an EC policy check fails, the verify task still appears in the table with ❌ results. Here, the table is empty — the pipeline never reached the policy evaluation stage.

The EC check has no branch protection requirement — the required_status_checks for the main branch are empty. This check is informational only; Prow CI gates mergeability.

Recommendations
  1. No action needed on the PR — The etcd peer certificate SAN and ClusterIP service changes in PR OCPBUGS-76530: Fix intermittent etcd peer communication failures #8479 are unrelated to this infrastructure failure. All unit tests, Prow CI checks, and the container image build passed.

  2. Retry via /retest — A fresh Konflux pipeline run will almost certainly succeed. The same check passes consistently on current PRs (CNTRLPLANE-3021: Multi-stream CoreOS boot image parsing in releaseinfo #8711CNTRLPLANE-3603: Migrate clients from CAPI v1beta1 to v1beta2 #8717 all show success as of today).

  3. Merge is not blocked — The Konflux EC check is not a required status check in branch protection rules. PR CNTRLPLANE-3535: Add codecov carryforward flags to stabilize project coverage checks #8642 was merged on the same day (2026-06-02) with an identical EC failure pattern on the hypershift-operator-enterprise-contract check.

  4. If failure persists after retry — Report to the Konflux team responsible for the crt-redhat-acm-tenant namespace as a Tekton PipelineRun scheduling issue in the stone-prd-rh01 cluster.

Evidence
Evidence Detail
Failed pipeline run control-plane-operator-enterprise-contract-nbcb8 — task results table is empty (0 tasks executed)
Other EC checks (same commit, same time) 4/4 completed normally: hypershift-operator-main-enterprise-contract (256 successes, 22 warnings), hypershift-operator-enterprise-contract (256 successes, 22 warnings), enterprise-contract-mce-50/hypershift-cli-mce-50 (273 successes), enterprise-contract-mce-50/hypershift-release-mce-50 (274 successes)
Image build check control-plane-operator-main-on-pull-requestsuccess (14:20:45 UTC)
PR #8642 same-day precedent hypershift-operator-enterprise-contractfailure with identical empty task table; PR was merged at 11:27:27 UTC
Recent PRs passing this check PR #8717 (success, Jun 11), PR #8715 (success, Jun 11), PR #8713 (success, Jun 11), PR #8711 (success, Jun 11)
Branch protection required_status_checks is empty — EC check is not a merge gate
Check not retried Only one run exists (Jun 2); no /retest has been issued since

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants